The NYC flights dataset contains information about flights coming in and out of NYC. We can inspect the first few elements here.
Let’s say we would like to visualize this dataset using leaflet. The problem is that we have a bunch of airport codes but we don’t know what city they’re in It would be nice if we had it mapped out for us (e.g. LAX = Los Angeles) but unfortunately that is not the case.
Good news is that information is easily available on the web! One doesn’t have to look to far to see a table of airport codes and the city they belong to.
Looking at the site http://www.leonardsguide.com/us-airport-codes.shtml we see a nice table already made for us.
But how do I get that data though?
We can employ R’s awesome “rvest” package for web scraping whatever we’d like.
Webscraping is relatively intuitive. There only a few functions we need:
read_html() scrapes the programming from the site.
read_nodes() returns the data contained in certain HTML “nodes”
html_text() converts this to text
Let’s see what it looks like in action
Now the tricky part is seeing which nodes contain our data. Here it would be wise use chrome dev tools to see which node contains the info. Here it looks like thenode contains the data.
Let’s take a moment to recap:
But it is still not in a format we would like. It should be a dataframe with state, city, and airport code as columns instead of a long vector. Now comes the power of the tidyverse.
## # A tibble: 480 x 1
## codes
## <chr>
## 1 Alabama
## 2 AL
## 3 Birmingham International Airport
## 4 BHM
## 5 Dothan Regional Airport
## 6 DHN
## 7 Huntsville International Airport
## 8 HSV
## 9 Mobile
## 10 MOB
## # ... with 470 more rows
Looking at the above, at first glance it would seem difficult to get this into the data frame we want. We also can see one observation that we need to fix…
You can see that some airports have the word “Airport” attached to their name, while others do not. This makes it hard to distinguish which is an airport and what is not. We know that any 3 letter airport code must have a preceding airport in the element above it. Let’s write code to write the word “Airport” to any element that needs it.
Example:
We see that MOB’s airport name is “Mobile” and should be “Mobile Airport”
#loop through dataframe
for(i in 1:nrow(codes)){
#find 3 letter airport codes
if(nchar(codes[[i,1]]) == 3){
#check if the string above it contains the word "Airport"
if(!str_detect(codes[[i-1,1]], "Airport")){
#if it doesn't then write the word "Airport" at the end
codes[[i-1, 1]] <- paste(codes[[i-1,1]], "Airport", sep = " ")
}
}
}Now we can take full advantage of R’s separate function.
#use R's separate function as we need to separate 3 times
#first regex splits anything thats 2 letters or 3 letters or contains "Airport"
#now you see why we needed to add the word Airport to a few elements
regexp1 <- "(?=(^...$|^..$))|(?=.*(?<=Airport))"
#splits only airports
regexp2 <- "(?=.*(?<=Airport))"
#splits only 3 letters
regexp3 <- "(?=(^...$))"
#now we perform a sweeping data clean
codes <- codes %>%
separate(col = 1,into = c("State", "StateAbbr"),
sep = regexp1,extra = "merge") %>%
separate(col = 2,into = c("StateAbbr", "Airport"),
sep = regexp2,extra="merge") %>%
separate(col = 2, into = c("StateAbbr","AirportCode"),
sep = regexp3, extra = "merge") %>%
mutate(
StateAbbr = lead(StateAbbr),
AirportCode = lead(AirportCode, n = 3),
Airport = lead(Airport, n = 2)
) %>%
drop_na() %>%
na_if("") %>%
fill(State,StateAbbr)Now look at that gorgeous dataframe
## # A tibble: 185 x 4
## State StateAbbr AirportCode Airport
## <chr> <chr> <chr> <chr>
## 1 Alabama AL BHM "Birmingham International Airport"
## 2 Alabama AL DHN "Dothan Regional Airport"
## 3 Alabama AL HSV "Huntsville International Airport"
## 4 Alabama AL MOB "Mobile Airport"
## 5 Alabama AL MGM "Montgomery Airport"
## 6 Alaska AK ANC "Anchorage International Airport"
## 7 Alaska AK FAI "Fairbanks International Airport"
## 8 Alaska AK JNU "Juneau International Airport "
## 9 Arizona AZ FLG "Flagstaff Airport"
## 10 Arizona AZ PHX "Phoenix, Phoenix Sky Harbor International Air~
## # ... with 175 more rows
The next step is to join our custom dataframe to the nycflights dataframe.
We run into an issue however… the nycflights isn’t necessarily tidy. We’d ideally like to join on origin or destination, but which one? The problem here is that we can’t because it is not tidy. We need to adjust those columns such that we have the airport code in one column and the “type” (either origin or departure in the other).
tidy.flights <- flights %>%
pivot_longer(cols = c(origin,dest), names_to = "airport_code_type",
values_to = "AirportCode")
tidy.flights <- tidy.flights %>%
left_join(codes, by = "AirportCode") %>%
drop_na()Not done yet. To see flight data we need the latitude and longitude of the airports. Let’s scrape this site: www.dices.net/movil/airports/airports-United_States-US-1.html
The issue here is that our data is spread across multiple pages or URLs.
#need to use lapply with paste or paste0
coords.html <- lapply(
paste0(
"http://www.dices.net/movil/airports/airports-United_States-US-",1:103,".html"
),function(url){
url %>% #pass in URL
read_html() %>% #get html
html_nodes("b") %>% #get node data
html_text() #make it text
})
#returns only the the 5th-84th elements as the others are not needed
coords.html <- lapply(coords.html, function(x){x[5:84]})
#start extracting
coords.vec <- unlist(coords.html) #unlist into vector
AirportCodes <- str_extract(coords.vec, "^...$") #get 3 letter codes
latlong <- str_extract(coords.vec, "(\\d.*)|(-\\d.*)") #get the digits
#combine latlong into a single element
for(i in seq(3,(length(latlong)-1),4)){
latlong[i] <- paste0(latlong[i],",",latlong[i+1])
}
#extracts the combined latlong elements
latlong <- str_extract(latlong,".*,.*")
coords <- data.frame(AirportCode = AirportCodes, latlong = latlong) %>%
mutate(AirportCode = lead(AirportCode), latlong = lead(latlong, n = 2)) %>%
drop_na() %>%
separate(latlong, c("lat","long"), sep = ",") %>% #now we can split on the ","
mutate(lat = as.numeric(lat),
long = as.numeric(long)) #convert from stringNow we can finally join in Airport Code
Finally! Look at that dataframe
## # A tibble: 636,432 x 24
## AirportCode lat long year month day dep_time sched_dep_time dep_delay
## <chr> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
## 1 CAK 41.1 -81.5 2013 1 1 1147 1155 -8
## 2 CAK 41.1 -81.5 2013 1 1 2020 2030 -10
## 3 CAK 41.1 -81.5 2013 1 2 1147 1155 -8
## 4 CAK 41.1 -81.5 2013 1 2 2024 2030 -6
## 5 CAK 41.1 -81.5 2013 1 3 1151 1155 -4
## 6 CAK 41.1 -81.5 2013 1 3 2021 2030 -9
## 7 CAK 41.1 -81.5 2013 1 4 1156 1155 1
## 8 CAK 41.1 -81.5 2013 1 4 2029 2030 -1
## 9 CAK 41.1 -81.5 2013 1 5 1144 1145 -1
## 10 CAK 41.1 -81.5 2013 1 5 2023 2030 -7
## # ... with 636,422 more rows, and 15 more variables: arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, airport_code_type <chr>, State <chr>, StateAbbr <chr>,
## # Airport <chr>
We filter by “dest” and see all the flight destinations from NY. It is important to note that leaflet looks for lat/long columns automatically. Please zoom in and out to see the clusters